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Performance  Analysis  and  Optimization  of  Asynchronous 
Circuits  Produced  by  Martin  Synthesis1 


Steven  M.  Burns 

Computer  Science  Department 
California  Institute  of  Technology 
Pasadena,  CA  91125  USA 

Abstract 

We  present  a  method  for  analyzing  the  timing  performance  of  asyn¬ 
chronous  circuits,  in  particular,  those  derived  by  program  transforma¬ 
tion  from  concurrent  programs  using  the  synthesis  approach  devel¬ 
oped  by  Martin.  The  analysis  method  produces  a  performance  met¬ 
ric  (related  to  the  time  needed  to  perform  an  operation)  in  terms  of 
the  primitive  gate  delays  of  the  circuit.  Because  the  gate  delays  are 
functions  of  transistor  sizes,  the  performance  metric  can  be  optimized 
with  respect  to  these  sizes.  For  a  large  class  of  asynchronous  circuits 
—  including  those  produced  by  Martin  synthesis  —  these  techniques 
produce  the  global  optimum  of  the  performance  metric.  A  CAD  tool 
has  been  implemented  to  perform  this  optimization. 


1  Introduction 

Performance  analysis  of  a  synchronous  computer  system  is  simplified  by  an 
external  clock  that  partitions  the  events  in  the  system  into  discrete  segments. 
In  asynchronous  systems,  no  such  quantization  exists.  Instead,  the  operation 
of  the  system  proceeds  at  a  rate  determined  by:  the  speed  of  its  individual 
components,  and  sequencing  of  the  operation  of  the  components.  Unlike  the 
synchronous  case,  the  time  needed  to  perform  an  asynchronous  computa¬ 
tion  cannot  be  determined  by  merely  counting  the  number  of  clock  cycles 
required  and  multiplying  by  the  clock  period.  Instead,  to  determine  the  time 
required  to  perform  the  computation  as  a  whole,  the  times  of  those  individual 
components  of  the  computation  that  must  occur  sequentially  are  summed. 

1  Presented  at  TAU  ’90,  the  1990  ACM  International  Workshop  on  Timing  Issues  in 
the  Specification  and  Synthesis  of  Digital  Systems,  August  14-17,  1990,  Vancouver,  BC, 
Canada 


The  techniques  required  to  analyze  asynchronous  systems  resemble  those 
used  to  determine  the  clock  period  of  a  synchronous  system;  that  of  summing 
the  delays  along  the  longest  path  through  the  combinational  logic  connecting 
adjacent  latches.  In  the  clocked  case,  the  critical  path  has  a  clear  beginning 
and  a  clear  end  because  all  paths  are  broken  by  latches.  No  clear  separation 
is  available  in  asynchronous  systems.  Analysis  procedures  must  deal  directly 
with  cyclic  critical  paths,  and  thus  existing  critical  path  analysis  tools  such 
as  CRYSTAL [9]  cannot  be  easily  applied  to  this  problem. 

This  paper  discusses  a  framework  for  determining  the  time  needed  to 
perform  computations  using  asynchronous  systems,  and  applies  especially  to 
repetitive  computations.  Previous  work  in  the  area  of  timed  Petri  nets  [10,  5] 
applies  to  this  problem  as  well.  The  results  we  describe  here  are  based  on 
event-rule  systems,  a  different  formalism  that  is  more  closely  connected  to 
the  methods  we  use  to  synthesize  the  asynchronous  systems.  Furthermore, 
we  use  our  formalism  to  model  the  performance  of  asynchronous  circuits  and 
provide  a  method  for  optimizing  such  circuits  for  performance. 

Martin  ([6]  and  elsewhere)  has  developed  a  synthesis  method  whereby 
asynchronous  circuits  are  produced  from  concurrent  program  descriptions. 
By  applying  a  systematic  series  of  semantics-preserving  transformations,  a 
high-level  description  (CSP  program)  is  refined,  using  the  intermediate  forms 
of  handshaking  expansions  and  production  rules,  until  a  provably  correct 
asynchronous  CMOS  circuit  is  constructed. 

At  each  stage  of  the  synthesis  procedure,  a  variety  of  transformations 
can  potentially  be  applied.  In  the  automated  compiler  of  [1],  these  choices 
are  made  so  that  the  same  subcircuit  template  can  be  used  to  implement 
each  instance  of  the  same  CSP  language  construct.  Instances  of  these  small 
templates  are  composed  together  to  form  a  correct  circuit  implementing  the 
original  CSP  program.  However,  in  order  to  produce  high-performance  cir¬ 
cuits,  these  choices  must  be  directed  by  performance  concerns.  We  observed 
this  potential  benefit  of  performance-directed  transformations  during  the  de¬ 
sign  of  the  Caltech  Asynchronous  Microprocessor^].  The  decisions  of  what 
transformation  to  apply  were  based  on  performance  goals  and  this  accounts 
for  its  high-performance. 

Event-rule  (ER)  systems  can  be  used  at  each  stage  of  the  synthesis  proce¬ 
dure  to  analyze  the  potential  performance  of  the  current  refinement.  Given  a 
trace  of  the  execution  of  a  complete,  closed  program  (environment  included), 
an  ER  system  can  be  generated  from  any  of  the  intermediate  forms:  CSP 


programs,  handshaking  expansions,  production  rules,  or  CMOS  circuits.  The 
trace  of  execution  is  used  to  unroll  each  process  that  contains  guarded  com¬ 
mands  into  a  straight-line  process.  In  the  cases  where  the  trace  of  execution 
repeats,  a  repetitive  ER  system  can  be  generated.  The  cycle  period  (the  time 
between  repeated  events)  can  be  determined  using  the  techniques  explained 
in  Section  2. 

These  techniques  provide  an  expression  for  the  cycle  period  in  terms  of 
maximums  and  sums  of  individual  component  delays.  At  the  circuit  level, 
the  component  delays  are  functions  of  transistor  widths  and,  as  such,  the 
cycle  period  can  be  optimized  with  respect  to  these  widths.  Nonlinear  op¬ 
timization  methods  (such  as  those  used  in  TILOS[3]  and  EPOXY[8])  can 
be  used  to  perform  the  optimization  of  this  expression  for  the  cycle  period. 
Our  approach  differs  from  those  used  for  synchronous  systems  because  we 
optimize  all  critical  paths  simultaneously. 


2  Event-Rule  Systems 

An  event-rule  (ER)  system ,  is  a  pair  ( E,R ),  where: 

E  is  a  set  of  events,  and 

E  is  a  set  of  rules  defining  the  timed  causal  dependencies  between  the 
events.  Each  r  G  R  is  written  e  A  /,  where 

e  G  E  is  the  source  of  r, 

/  €  E  is  the  target  of  r,  and 
a  €  [0,  +oo)  is  the  delay  of  r. 

Neither  E  nor  R  need  be  finite.  When  R  is  infinite,  we  require  that  no  event 
is  the  target  of  an  infinite  number  of  rules.  Sometimes  it  is  convenient  to 
view  (E,  R)  as  a  directed  graph  (multiple  arcs  and  self-loops  allowed);  this 
graph  will  be  referred  to  as  the  constraint  graph  G.  For  a  given  (E,  R ),  there 
is  a  (possibly  empty)  set  of  functions  T,  that  satisfies: 

T  is  a  subset  of  the  functions  from  E  to  [0,  +oo)  ; 

t  G  T  if  and  only  if 

t(f)  >  t(e)  +  a  for  every  eA/eiJ. 


(1) 


We  call  a  function  t  in  the  set  T  a  timing  function  of  (E,  R).  Each  t  represents 
a  possible  or  consistent  timing  specification  for  the  events  of  the  system.  If 
the  set  T  is  empty,  the  constraints  (1)  cannot  be  satisfied  by  any  such  function 
t.  In  this  case,  the  (E,  R)  is  called  infeasible ;  otherwise,  it  is  called  feasible. 

Example  2.1  Consider  the  ( E,R }  with: 

E  =  {a,b,c} 

R  =  {a  ft  6,  a,  6  ft  c} 

This  ER  system  is  feasible  if  and  only  if  aa  =  0  and  ceh  =  0. 

The  smallest  timing  function  denotes  the  earliest  time  at  which  the  events 
of  E  can  execute.  Any  feasible  ER  system  with  cyclic  constraints  or  zero  delay 
rules  can  be  transformed  into  an  equivalent  one  that  satisfies  the  hypotheses 
of  Lemma  2.1. 


Lemma  2.1  If  { E,R )  is  feasible,  the  constraint  graph  G  is  acyclic,  and 
a  >  0  for  every  rule  in  R,  then  there  exists  a  unique  function  t  €  T  such 
that  for  every  t  E  T, 


t(e)  <  t(e)  for  every  e  E  E. 

We  call  t  the  timing  simulation  of  (E,  R). 

Proof:  We  propose  the  following  recursive  definition  for  t: 


(2) 


*(/) 


max{£(e)  +  a  |  e  A  /  E  R}  otherwise 


if  {e  |  e  A  f  g  R}  =  0 


(3) 


Such  a  function  is  well-defined,  because  G  is  acyclic  and  thus  there  are  no 
circular  dependencies  between  the  events  in  E. 

We  show,  by  contradiction,  that  this  t  satisfies  (2).  Pick  a  t  such  that  the 
set  F  of  events  e  that  satisfy  t(e)  <  t(e)  is  non-empty.  Let  f  E  F  have  the 
smallest  t(f)  .  Then  for  some  e  A  /  e  R  , 

Kf)  <  i(f)  =  t(e)  +  a<  t(e)  +  a<  t(f)  . 

The  equality  of  the  previous  line  follows  by  choosing  e  A  /  as  the  rule  that 
achieves  the  maximum  in  (3).  The  inequality  t(e)  +  a  <  t(e)  +  a  follows, 
since  e  ^  F\  that  is,  £(e)  is  strictly  less  than  £(/).  The  last  inequality  holds 
by  (1).  Thus,  F  is  empty  and  t  satisfies  (2).  | 


Example  2.2  The  ER  system  defined  by  the  constraint  graph: 


a 


e 


la'» 

b  ^ 


d 


has  the  timing  simulation: 

t(a )  =  0 
t(e)  =  0 

t(b)  =  max(aoi,  aeb ) 

t(c)  =  max(a„{,  aeb )  +  abc 

t(d)  =  max(aaj,  o:ej)  +  a;,,,  +  aCd 

2.1  Repetitive  Systems 

ER  systems  of  unbounded  size  constructed  from  finite  circuits  can  be  rep¬ 
resented  by  a  finite  set  of  events  that  are  repeated  infinitely  often.  Let  the 
event  set  E  be  generated  from  the  finite  set  E'  by 


E  =  E'  x  N. 


The  elements  of  the  finite  set  R'  are  quadruples: 

( u,v,a,e )  £  R'i  where  R'  C  E'  x  E'  x  [0, +oo)  x  Z, 
which  we  will  write  as 

{ u ,  i  —  e)  A  ( v , «). 

The  set  i?  is  the  set  of  all  instantiations  of  the  rules  r  e  R'  with  i  >  max{0,  s}. 

We  define  the  collapsed  constraint  graph  G  of  (E1,  R')  as  the  directed 
graph  with  nodes  from  E'  and  arcs  from  R' . 

Example  2.3  Consider  the  repetitive  ER  system  constructed  from  a  circuit  con- 


taining  a  single  Muller  C-element: 
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The  repeated  events  (those  events  generated  from  E')  represent  the  occurrence 
of  transitions  of  circuit  variables.  The  event  (x  f,i)  represents  the  ith  repetition 
(or  occurrence)  of  a  transition  from  x  =  false  to  x  =  true.  Similarly,  (x  j,  i) 
represents  the  ith  repetition  of  a  transition  from  x  =  true  to  x  =  false.  The 
repeated  rules  correspond  to  dependencies  introduced  by  the  inverters  and  the 
C-element  that  make  up  the  circuit.  We  can  represent  the  infinite  sets  E  and  R 
graphically: 


/ 

(*t>0) 

avt 

\ 


a*l  <*4 

\  / 

azl  ayl 

S  \ 

(yt,o) 


{xii  o) 

\  / 

a*r 

/  \ 

(y|,0> 


Notice  that  event  (z|>0)  has  no  predecessors.  In  the  timing  simulation,  f((zf,0}) 
is  set  to  0.  (For  ease  of  notation,  <((^t,0))  will  sometimes  be  written  as  <(zf,  0).) 
The  entire  timing  simulation  f,  which  can  be  constructed  by  inspection  from  the 
constraint  graph,  is: 


<(*?»*) 
<(*t  ,0 
Kv  T>*) 

i(z[,i) 

t{xi,i) 

t(yl  ,*) 


=  pi 

=  otx T  +  pi 
=  Oltf+Pi 

=  max(aI|,  a:yf)  +  ocz\.  +  pi 
=  max(axp  ocy^)  +  a4  +  ax j  +  pi 
=  max(a,.j,  afyj)  +  cez j  +  ay^  +  pi 


where  p  =  max(a,t,  ayT)  +  azl  +  max(a4,  ayl)  +  a4. 


2.2  Linear  Timing  Functions 

In  the  previous  example,  we  saw  that  the  timing  simulation  of  a  repetitive 
ER  system  took  on  a  simple  form  that  is  linear  in  the  occurrence  index  i. 
This  is  not  the  case  for  all  repetitive  ER  systems.  However,  as  we  now  show, 
a  linear  timing  function  exists  whenever  the  timing  simulation  exists,  and  the 
“best”  such  function  will  be  a  good  approximation  of  the  timing  simulation. 
We  call  t  ET  a  linear  timing  function  of  {E1,  /?'),  if 

t(v,  i)  =  xv+  pvi  for  every  v  E  E'  and  i  E  N.  (4) 

Each  xv  and  pv  are  independent  of  i.  For  each  v  E  E',  xv  and  pv  are  called, 
respectively,  the  offset  and  cycle  period  of  the  repeated  event  v. 

Because  of  the  linear  form  of  t,  the  timing  function  constraints,  (1),  re¬ 
duce  to  linear  inequalities  in  the  offsets  and  cycle  periods  of  the  events. 
All  dependence  on  the  occurrence  index  i  can  be  eliminated.  For  each  rule 
r  =  {u,i  —  e)  ( v ,  i)  E  R',  we  have  the  infinite  set  of  constraints: 

t(v,  0  >  t(y>,  i  —  s)  +  a  ,  for  each  i  >  max(0,  e) 

Replacing  t  by  its  definition  (4),  we  get 

xv  +  pvi  >  xu  +  pu(i  —  s)  +  a 

xv  >  xu  —  pue  +  a  +  ( pu  —  pv)i  . 

The  preceding  equations  can  never  be  satisfied  for  all  i  when  pu>  pv.  Thus, 
the  infinite  set  of  constraints  generated  by  r  can  be  replaced  by  the  two 
inequalities, 


xv  >  xu  -  pus  +  a  ,  and  (5) 

Pv  >  Pu  ■  (6) 

From  (6)  we  see  that  for  a  feasible  solution  to  exist,  a  partial  ordering 
between  the  pvJ s  must  be  satisfied.  If  two  nodes,  u  and  v,  are  in  the  same 
cycle  of  the  collapsed  constraint  graph  G ",  then  pu  must  equal  pv.  All  events 
in  the  same  strongly  connected  component  of  G'  have  the  same  cycle  period. 
In  the  following,  we  consider  only  those  repetitive  ER  systems  in  which  G'  is 
strongly  connected,  and  p  is  used  to  denote  the  cycle  period  of  every  element 
in  E'. 


2.3  Linear  Programming 

Among  the  possible  linear  timing  functions,  there  are  those  that  minimize 
the  cycle  period  p.  The  techniques  of  linear  programming [4]  can  be  used  to 
find  such  a  minimum-period,  linear  timing  function. 

The  constraints  of  a  linear  timing  function,  (5),  are  simple  linear  inequal¬ 
ities  in  the  xv’s  and  p.  By  ordering  the  sets  E'  and  R',  we  can  construct  a 
linear  program  in  matrix  form: 

minOTa:  +  lTp  =  z 

A'x  +  sp  >  a  •  (7) 

x,P  >  0  , 

The  matrix  A'  is  the  edge- vertex  incidence  matrix  of  the  collapsed  constraint 
graph  G'.  If  row  j  of  A '  represents  the  constraint  rj  =  ( u ,  i-e)  A  (v,  *')  G  R', 
and  column  k  of  A'  represents  the  event  uk  G  E\  then 

—  1  if  uk  =  u 
o!j k  =  1  if  uk  =  v 

0  otherwise  . 

The  jth  elements  of  the  (column)  vectors  e  and  a  are  the  scalar  quantities  £ 
and  a  of  the  constraint  r,,  respectively. 

Example  2.4  Consider  the  repetitive  ER  system: 

E'  =  {Zof,ZiT,roT,ri'f,fo!,/i!,ro!,HJ.} 

R>  =  {(Zi|,z-1)  “A  (lo^i), 

{roi,i- 1)  A  <Zot,»), 

(ZtT,i)  A  (loti), 

(ri],i)  A  (rof,  Z), 

(lol,i)  A  (rof,*), 

(r*i,*)  A1  (ro|,  *), 

(Zot,*)  A  (Z*t,*), 

A  (Ui,i), 

(rof,  *  —  1)  A  (r*t,i), 

(rot,*)  A  (r*t,*) 

} 


In  this  case,  equation  (7)  becomes: 
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The  duality  theorem  of  Linear  Programming  relates  the  primal  program 
(7)  to  the  dual  program: 


maxyTQ!  =  w  ' 
yT  A’  <  0T 
yTe  <  1T 
V  >  0  , 


(8) 


If  both  the  primal  and  the  dual  programs  have  optimal  solutions,  then  the 
optimal  value  2  of  the  primal  equals  the  optimal  value  w  of  the  dual.  We 
solve  this  dual  program  (8)  in  order  to  determine  the  cycle  period. 


2.4  Cycle  Vectors  of  a  Graph 

A  cycle  of  length  £  in  a  directed  graph,  Q  =  (V,  £)  (multiple  edges  and  self¬ 
loops  allowed),  is  an  ordered  subset  C  =  (c0,  cu  . . . ,  q_  1)  of  the  edges  E  such 
that  target(ck_  1)  =  source(ck )  for  all  0  <  k  <  £and  target(ce_  1)  =  source(c0). 
The  cycle  C  can  be  represented  by  a  cycle  vector  u,  a  {0,  l}-vector  of  length 
\£\,  where  u5  =  1  if  and  only  if  the  jth  edge  of  £  is  in  the  set  C.  For  each 
cycle  vector  u,  uT A'  —  0T ,  where  A'  is  the  edge-vertex  incidence  matrix  of 
the  graph  Q.  The  following  lemma  relates  the  cycle  vectors  to  an  arbitrary 
vector  y  satisfying  yT A'  <  0T. 

Lemma  2.2  Let  Ui,  0  <  i  <  q  denote  the  cycle  vectors  of  a  graph  with 
edge-vertex  incidence  matrix  A'.  Then,  if  y  >  0  is  such  that  yT A'  <  0T, 
there  exist  scalars  0t-  >  0,  0  <  i  <  q  such  that 


Proof:  See  [2]  for  complete  proof.  Follows  by  induction  on  the  number  of 
cycles  in  the  graph  of  A'.  | 

This  lemma  provides  a  straightforward  means  of  determining  the  mini¬ 
mum  cycle  period  p.  By  enumerating  every  cycle  in  G,  and  computing  the 
delay  around  that  cycle,  we  can  find  p. 

Theorem  2.3  The  minimum  cycle  period  p,  or  equivalently,  the  optimal 
value  w  of  the  dual  program  (8),  is 

(10) 

Proof:  Let  U  be  the  cycle  matrix  constructed  by  concatenating  the  (col¬ 
umn)  cycle  vectors  U0,  Uu  . . . ,  Uq-i.  By  construction,  UT A'  <  0  (actually 
equality).  By  lemma  2.2,  any  y  >  0  with  yT A'  <  0T  can  be  represented  as 
the  product  UQ,  where  the  vector  0  has  non-negative  elements.  The  dual 
program  (8)  reduces  to: 

z  =  max0T([/Ta) 

QT{UT£ )  <  1 

0  >  0 


max 


uT 


a 


U£e 


for  all  cycle  vectors  £4 


The  dual  of  the  reduced  dual  is  easily  solved: 

z  =  minp  (11) 

( UTe)p  >  ( UTa )  (12) 

P  >  0  (13) 

The  smallest  scalar  p,  which  satisfies  the  vector  inequality  (12)  yields  the 
desired  minimum  cycle  period.  | 


Example  2.5  The  minimum  cycle  period  of  the  previous  example  can  be 


con- 


The  collapsed  constraint 

graph  G'  is: 
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In  the  graphical  representation,  each  constraint  r  is  labeled  with  ar-erp.  Summing 
the  delays  along  the  three  cycles  through  G', 

C0  =  (lo'tJi^loi^o^ri^roi), 

Ci  =  (lo'\,li'\,lol,lil)  and 
C2  =  (rot,r*J.,ro|,rit), 

yields: 

Po  =  a/oT  +  a/if  +  ocioi  +  aroj  +  ar;j  +  aro[ 

Pi  =  «ioT  +  oti, t  +  a*oi  + 

P2  =  oir0 f  +  ar;j  +  aroj  +  arq 

By  Theorem  2.3,  p  =  max(p0,Pi,P2)- 

2.5  Approximating  the  Timing  Simulation 

We  now  show  that  a  minimum-period  linear  timing  function  provides  an 
accurate  approximation  to  the  timing  simulation. 

Theorem  2.4  Let  t  and  t  be  a  minimum-period  linear  timing  function  and 
the  timing  simulation,  respectively,  of  a  connected  repetitive  ER  system. 
There  exists  a  finite  B  such  that  for  all  u  G  E’  and  all  i  >  0 

su,i  =  t(u,  i)  -  i(u,  i)  <  B  . 

Proof:  By  definition  for  each  u  and  i 

t(u,  i)  =  xu  +  pi 

t(u,  i )  =  xu+  pi  -  sUj  . 

Each  sUfi  is  nonnegative  because  i  is  the  smallest  timing  function.  For  the 
constraints  generated  from  r  =  (u,  i  -  e)  &  (v,  i )  G  R',  we  define  the  non¬ 
negative  slack  variables  zrj  and  zr,  thus  transforming  inequalities  into  equal¬ 
ities: 

xu  —  ps  +  a  +  zr  —  xv 

xu  +  p(i  —  e)  —  su,i_e  +  a  +  zr>i  =  xv  +  pi  -  sVji 


(14) 

(15) 


By  subtracting  these  equations  and  simplifying,  we  get 


Zr  Zrj  SVji  Suj— e  •  (1®) 

From  Theorem  2.3  ,_pJ2rec£r  =  £rec  ar  for  at  least  one  cycle  C.  Adding 
the  constraints  on  t,  (14),  for  each  r  £  C,  we  see  that 

^  P  ^  /  £ r  T  )  y  T  ^  ,  %r  =  )  '  %vr 

rec  rec  rec  rec  rec 

Since  along  any  cycle  £reC  xUr  =  £reC  xVr,  we  have  for  all  r  £  C, 


zr  =  0  . 

By  (16),  su>i-£  >  svj  for  all  i  >  max(0,  e)  and  all  u,  v  on  cycle  C .  By  summing 
along  the  cycle  C,  we  see  that  for  each  u  £  C  and  i'  >  0 


su,i>  >  sU:i  where  i  =  i’  +  ^  er  . 

r€C 

Therefore,  we  can  bound  sUj f,  for  every  u  £  C,  by 


For  any  event,  h,  not  on  cycle,  C,  we  find  a  path,  Ph,  to  this  event  from 
an  event  g  on  C.  Because  G'  is  strongly  connected,  such  a  path  must  exist 
and  be  independent  of  i.  Then,  by  summing  (16)  along  that  path,  we  get  for 
all  i,  i!  >  0 

SgJ  ^2  ^r  >  sh,i  > 
rePh 

where  i  =  i'+J2rePh  £r-  But  £repfc  zr  is  independent  of  i;  thus,  s^i  is  bounded 
by  a  quantity  that  does  not  increase  with  successive  occurrences.  Thus,  every 
Sh,i  with  h  £  C  is  bounded  by  B  where 


B  > 


max 


h  £  C  A  i  < 


E 

rePk 


and 


B  >  B'  +  max 


E 

rePh 


.  I 


3  Performance  Optimization 

Using  the  above  analysis  method,  a  performance  metric  (the  minimum  cycle 
period  p)  can  be  expressed  in  terms  of  the  primitive  delays  of  an  ER  system. 
If  the  ER  system  is  modeling  a  CMOS  circuit,  these  primitive  delays  are 
determined  by  transistor  widths.  Adjusting  the  transistor  widths  affects 
the  performance  metric,  but  the  nature  of  the  dependence  is  completely 
encapsulated  by  the  expression  for  the  minimum  cycle  period.  Minimizing 
the  expression  for  p  in  terms  of  the  transistor  widths  yields  an  optimally 
sized  circuit. 

3.1  Tau  Model 


Figure  1:  Linear  approximation  of  a  CMOS  pulldown. 


A  simple  RC  switch  model  is  used  to  relate  each  individual  delay  a,  to  the 
widths  of  circuit’s  transistors  (w’ s).  Each  transistor  is  modeled  as  a  switch 
with  a  resistance  inversely  proportional  to  its  width.  The  gate  of  a  transistor 
has  a  capacitance  to  ground  proportional  to  its  width.  Source  and  drain 
capacitances  are  also  proportional  to  transistor  widths.  Thus,  the  delays 
between  a  t  and  z  |,  and  b  |  and  z  l,  of  the  circuit  shown  in  Figure  1  are 
modeled  as: 


a44  —  RiCt  +  (f?i  +  Rq)C2 

«ST4  =  (-^1  +  R-2)C2 
Ri  =  n/wi 
R2  =  n/w2 

=  Kds(wx  +  w2 ) 


^2  =  Kis(w2  +  W3)  +  Cwiring  +  Kg(w  4  +  W5)  , 

where  p  is  a  constant  that  describes  the  differing  per-unit-width  strengths 
of  the  n-  and  p-channel  transistors,  Kds  is  the  per-unit-width  capacitance 
contributed  by  the  drain  or  source  terminals,  Kg  is  the  per-unit-width  gate 
capacitance  and  Cwiring  is  the  capacitance  contributed  by  wiring.  All  capac¬ 
itances  are  expressed  in  terms  of  transistor  width  and  thus  I<g  =  1.  The 
delays  (a’s)  are  in  units  of  r,  the  time  needed  for  a  unit-width  n-channel 
transistor  to  switch  a  unit- width  load.  (Thus,  pn  =  1,  pp  >  1.)  The  values 
of  Kda  and  Cwiring  are  not  constant,  but  depend  on  the  final  circuit  layout 
that  depends  weakly  on  the  transistor  widths.  This  dependence  is  normally 
small  and  is  ignored  in  the  optimization  problem. 

3.2  Convex  Objective  Function 

Every  a  derived  using  this  simple  model  (and  also  many  more  accurate  ones) 
is  a  posynomial  functions  (polynomial  with  positive  coefficients  and  positive 
variables)  of  the  transistor  widths  w' s,  and  thus  a  convex  function  of  the 
logu>’s[4].  Because  both  the  sum  and  the  maximum  of  two  convex  functions 
are  convex  functions,  the  resulting  expression  for  p  is  a  convex  function  of 
the  logo’s;  and,  thus,  each  minimum  of  p  is  global.  The  addition  of  convex 
constraints,  for  example,  to  limit  energy  usage  or  to  bound  transistor  sizes, 
does  not  alter  the  unique  minimum  property. 

We  have  implemented  a  program  for  solving  the  resulting  nonlinear,  non- 
differentiable,  convex  optimization  problems  based  on  the  subgradient  tech¬ 
niques  described  by  Shor[ll].  Table  1  lists  the  results  of  this  program  when 
applied  to  a  variety  of  circuits.  The  column  ntrans  denotes  the  number  of 
transistors  in  the  circuit,  and  thus  the  number  of  free  variables  in  the  opti¬ 
mization  problem.  The  columns  punsized  and  ps;zed  show  the  cycle  period  in 
units  of  r  of  the  circuit  before  and  after  optimization.  In  the  unsized  case, 
all  transistors  have  equal  sizes.  The  CPU  column  denotes  the  number  of 
CPU  seconds  needed  to  compute  the  optimum  value  on  a  SUN/Sparcstation 
1.  The  performance  metric  of  the  sized  circuit  is  generally  30  percent  faster 
than  the  unsized  circuit.  A  direct  implementation  of  the  optimization  al¬ 
gorithm  requires  0(njrans  +  ncyc\eJm.AX)  arithmetic  operations  per  iteration, 
where  ncycles  is  the  number  of  cycles  used  to  form  the  cycle  period  function 
and  4iax  is  the  maximum  number  of  edges  per  cycle.  A  more  sophisticated 


implementation  that  does  not  require  enumeration  of  all  cycles  is  described 
in  [2]  and  requires  only  0(n^ranJ  arithmetic  operations  per  iteration.  In 
this  case,  the  linear  program  for  the  cycle  period  is  solved  directly,  at  each 
iteration,  by  a  special-purpose  algorithm. 


4  Summary 

We  have  demonstrated  a  method  for  determining  the  performance  of  circuits 
described  by  event-rule  systems.  Furthermore,  we  have  shown  how  to  opti¬ 
mally  size  transistors  in  such  circuits.  What  we  have  not  shown,  due  to  lack 
of  space,  is  how  to  transform  the  specifications  of  asynchronous  circuits  that 
we  use  for  synthesis  into  ER  systems.  With  the  addition  of  these  techniques, 
we  have  a  complete  method  combining  synthesis  and  performance  analysis. 
The  performance  analysis  can  be  done  early  and  at  each  level  of  the  synthe¬ 
sis  procedure  and  can  be  used  to  guide  the  synthesis  of  efficient  circuits.  A 
complete  description  is  given  in  [2]. 
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^■trails 

Punsized 

Psized 

CPU  (s) 

Three  stage  pipeline  control 

59 

189 

143 

42 

Ten  stage  pipeline  control 

192 

189 

151 

190 

Ten  stage  pipeline  control* 

192 

189 

151 

95 

Simple  microprocessor  control* 

285 

646 

430 

369 

*  indicates  results  generated  by  the  special-purpose  algorithm 

Table  1:  Performance  of  optimization  tool. 
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